-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[mlir][xegpu] Add OptimizeBlockLoads pass. #165483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
akroviakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments
| %4:4 = scf.for %arg3 = %c0 to %c256 step %c32 iter_args(%arg4 = %1, %arg5 = %1, %arg6 = %1, %arg7 = %1) | ||
| -> (vector<8x16xf32>, vector<8x16xf32>, vector<8x16xf32>, vector<8x16xf32>) { | ||
| %6 = xegpu.load_nd %3[%c0, %arg3] { layout_result_0 = #b } | ||
| : !xegpu.tensor_desc<32x16xf16, #b, #xegpu.block_tdesc_attr<array_length = 2 : i64>> -> vector<2x32x16xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the #b layout is 2d but the result vector is 3d. I think we should consider just using 2d (32x32 instead of 2x32x16) for block array load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs some changes to XeGPU op verifications and some existing lowering passes. So I would like to keep this test to show the current capability of this pass.
I will do the cleanup for 3D vector representation in another PR.
| // converted. | ||
| target.addDynamicallyLegalOp<xegpu::CreateNdDescOp>( | ||
| [&](xegpu::CreateNdDescOp createNdOp) { | ||
| return !canBeOptimized(createNdOp.getType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using name like "canLoadTransposed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to canBeCanonicalizedForTranspose. I don't want to mention any specific op (like Load) in the name because we are modifying an op sequence here.
| VectorType origVectorType = | ||
| VectorType::get(origDataShape, adaptorType.getElementType()); | ||
| Value data; | ||
| // Orig data shape is 3D for the array length case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 2d works better for array block load result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. This will be cleaned up. I will create an issue to track this.
Jianhui-Li
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| rewriter, loc, | ||
| VectorType::get(supportedShape, data.getType().getElementType()), | ||
| newTensorDesc, ArrayRef<OpFoldResult>{loadOffsetX, loadOffsetY}, | ||
| origLoadOp.getPackedAttr(), origLoadOp.getTransposeAttr(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Do we still need to set Pack and Transpose attributes? I think these info are associated with the layout, and we only set them at the WI level when the lane level layout is dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. this is only for copying existing attributes from the f16 load to i32 load. it has no impact.
|
Just a fly-by comment, not adding the new pass to the pipeline? |
ah sorry. I was unaware of this requirement. I think I will add it once this is e2e tested properly downstream. |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/129/builds/32450 Here is the relevant piece of the build log for the reference |
Not a requirement per se, just curious 😀 Ideally, the pipeline should be kept up to date when possible. But no worry if a new pass/transform needs more testing first (standalone or within the pipeline) or might not be universally applicable. |
This pass rewrites certain xegpu
CreateNdandLoadNdoperations that feeds intovector.transposeto more optimal form to improve performance. Specifically, low precision (bitwidth < 32)LoadNdops that feeds into transpose ops are rewritten to i32 loads with a valid transpose layout such that later passes can use the load with transpose HW feature to accelerate such load ops.Update:
Pass is renamed to
OptimizeBlockLoadsbecause later we plan to add the array length optimization into this pass as well. This will break down a larger load (like32x32xf16) into more DPAS-favorable array length loads (32x16xf16with array length = 2). Both these optmizations require rewritingCreateNdandLoadNdand it makes sense to have a common pass for both.